Shortest-Path Graph Kernels for Document Similarity

نویسندگان

  • Giannis Nikolentzos
  • Polykarpos Meladianos
  • François Rousseau
  • Yannis Stavrakas
  • Michalis Vazirgiannis
چکیده

In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by using a modified shortest-path graph kernel. We evaluate our approach on two tasks and compare it against several baseline approaches using various performance metrics such as DET curves and macro-average F1-score. Experimental results on a range of datasets showed that our proposed approach outperforms traditional techniques and is capable of measuring more accurately the similarity between two documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CLAIRLIB Documentation v1.03

The Clair library is intended to simplify a number of generic tasks in Natural Language Processing (NLP), Information Retrieval (IR), and Network Analysis. Its architecture also allows for external software to be plugged in with very little effort. Functionality native to Clairlib includes Tokenization, Summarization, LexRank, Biased LexRank, Document Clustering, Document Indexing, PageRank, Bi...

متن کامل

A Graph Based Authorship Identification Approach: Notebook for PAN at CLEF 2015

The paper describes our approach for the Authorship Identification task at the PAN CLEF 2015. We extract textual patterns based on features obtained from shortest path walks over Integrated Syntactic Graphs (ISG). Then we calculate a similarity between the unknown document and the known document with these patterns. The approach uses a predefined threshold in order to decide if the unknown docu...

متن کامل

A Shortest Path Similarity Matrix based Spectral Clustering

This paper proposed a new spectral graph clustering model by casting the non-categorical spatial data sets into an undirected graph. Decomposition of the graph to Delaunay graph has been done for computational efficiency. All pair shortest path based model has been adapted for the creation of the underlying Laplacian matrix of the graph. The similarity among the nodes of the graph is measured b...

متن کامل

Generalized Shortest Path Kernel on Graphs

We consider the problem of classifying graphs using graph kernels. We define a new graph kernel, called the generalized shortest path kernel, based on the number and length of shortest paths between nodes. For our example classification problem, we consider the task of classifying random graphs from two well-known families, by the number of clusters they contain. We verify empirically that the ...

متن کامل

A Shortest Path Dependency Kernel for Relation Extraction

We present a novel approach to relation extraction, based on the observation that the information required to assert a relationship between two named entities in the same sentence is typically captured by the shortest path between the two entities in the dependency graph. Experiments on extracting top-level relations from the ACE (Automated Content Extraction) newspaper corpus show that the new...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017